-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: concat of series of dtype category converting to object dtype (GH8641) #8714
Conversation
cc @immerrr maybe you can think of a better way to do this. 'feels' a bit hacky |
It does indeed, let me have a closer look, I don't have my Emacs at hand right now. |
There are two things that bother me. One thing that bothers me is that concatenation code somehow crept back into Block class. Concatenating Categoricals (with ndarrays or with other categoricals) seems a valid scenario regardless of Blocks. It creates those really weird workflows when you need to create Blocks just to concatenate two values (I'm looking at you def concatenate_arrays(arrays):
for fn in [pd.core.categorical.maybe_concatenate_categoricals,
pd.core.sparse.array.maybe_concatenate_sparsearrays]:
retval = fn(arrays)
if retval is not None:
return retval
else:
return np.concatenate([maybe_getvalues(a) for a in arrays]) This particular PR might probably get away with just fixing _concat_compat and leaving the rest of the code intact (if Series constructor supports Categorical data). And speaking of The other thing is that Series being doubly special strikes back again: not only its first axis doesn't allow type variation (special case #1), but there is also an implicit axis=1 which allows Series to be stacked horizontally (special case #2). I wonder how much will it take to stop pretending and make Series internally a dataframe and if it will reduce the amount of special-casing. I also wonder if there's a place under the sun for type-varying (aka non-singleblock) Series, one use case that immediately comes to mind is that |
I think your first point is very valid - though (aside from the obvious impl issues) why do you think concatenation should be functional rather than exist in Block? is their a conceptual/fundamental reason of just (the exhibited code complexity)? you second point though I am confused axis=1 is not a special case of Series rather a special case of merge - one that is an impl detail and can be done trivially rather than going thru the merge code |
can u show the DatetimeIndex.hour issue/example? from above |
so you are proposing (sort of in between 1 and 2) to effectively remove SingleBlockManager and make it a pure BlockManager I think you could (and prob not lose perf though you do lose opportunity to do some optimizations - not sure how much that would affect it) want to create an issue for that? |
The second point is rather a thought experiment and I probably should have not mentioned it here. Let's take it off the table right away and move it to another issue (or probably to the mailing-list). |
re: datetimeindex: Provided I have an array of dates and I want their respective day values, the easiest way is to do that is via a DatetimeIndex: In [21]: arr = np.array([datetime(2014, 1, 1), datetime(2014, 1, 2), datetime(2014, 1, 3)])
In [22]: pd.DatetimeIndex(arr).day
Out[22]: array([1, 2, 3], dtype=int32) I find it weird that I have to create an index to perform an operation on an array. I know, it's the case of Index fixing Array shortcomings and I haven't yet found time to fix this. |
As for should the concatenation be functional or not, it's really a matter of preference, but as long as one cannot simply do |
And just to be clear, in terms of this particular PR I think it should work to implement special-casing for categorical concatenation in _concat_compat and revert most (maybe all) other changes. |
And in general, I think that ideally Blocks should only maintain ref_locs and placement consistency with the data, concatenation rather belongs to array-related code. |
7302188
to
122070f
Compare
I agree with all of your points:
|
updated and pushed. pls take a look. |
def convert(x): | ||
if is_categorical_dtype(x.dtype): | ||
return x.get_values().ravel() | ||
return x.ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it have to ravel everything here (and in the previous row)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the Categorical currently expects 1-d input (it doesn't need in the cateogorical line actually, but the object line does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
122070f
to
a382c96
Compare
elif is_bool_dtype(dtype): | ||
typ = 'bool' | ||
if typ is not None: | ||
typs.add(typ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this code could be merged with type detection in core.internals.get_empty_dtype_and_na
, since the latter can be thought of a "forecast" of the dtype of array-like concatenation (special casing untyped null array-likes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it could - want to take it?
or maybe push to another issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it another issue then.
Looks solid. I mean, true, now there are noticeable logic blocks in concat_compat that can be moved to a sub-functions, but I'm not sure if it is that bad on readability. Maybe summon a fresh pair of eyes? :) |
Yeah, that's much better. Seems a tad slower because of all the stuff Series ctor does, but definitely looks more intuitive to me. |
a382c96
to
377278a
Compare
@immerrr ok, finally fixed this. We could dispatch to the sub-types to avoid having some of this code all in common e.g. Categorical,SparseArray (maybe in tseries/common.py for timedelta/datetime) ? |
377278a
to
7cee795
Compare
@immerrr I fixed the impl to use my suggestion above, much cleaner IMHO. |
Much cleaner indeed. Although I'm still a bit worried that this PR does a bit too much corollary change (e.g. bool handling) without any tests for that change. |
ok going to delay this eg the bool and and int casting is a special case where np.concatenate does the wrong thing and np.find_commom_type is basically useles so prob need a systematic test of : |
15b73e8
to
78f19fd
Compare
@immerrr I updated hopefully to catch all of these empty-empty combo cases. pls have a look. |
78f19fd
to
93dafe2
Compare
ok, I think fixed up. the only weird one is lmk |
Added couple more nitpicks, but generally looks fine. |
93dafe2
to
cf56ff1
Compare
BUG: concat of series of dtype category converting to object dtype (GH8641)
closes #8641